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In this paper the Zipf-Mandelbrot law is revisited in the context of linguistics. Despite its 
widespread popularity the Zipf-Mandelbrot law can only describe the statistical behaviour of a 
rather restricted fraction of the total number of words contained in some given corpus. In particu- 
lar, we focus our attention on the important deviations that become statistically relevant as larger 
corpora are considered and that ultimately could be understood as salient features of the underlying 
complex process of language generation. Finally, it is shown that all the different observed regimes 
can be accurately encompassed within a single mathematical framework recently introduced by C. 
Tsallis. 
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I. INTRODUCTION 

In 1932 George Zipf put forward an empirical obser- 
vation on certain statistical regularities of human writ- 
ings that has become the most well known statement of 
quantitative linguistics. He found that in a given text 
corpus there is an approximate mathematical relation be- 
tween the frequency of occurrence of each word and its 
rank in the list of all the words used in the text ordered 
by decreasing frequency. He also pointed out similar re- 
lations that hold in other contexts as well however, 
in this work we shall just concentrate on its applications 
in linguistics. 

Let us identify a particular word by an index s equal 
to its rank, and by /(s) the normalised frequency of oc- 
currence of that word, that is, the number of times it 
appears in the text divided by the total number of words 
N . Then, Zipf's law states that the following relation 
holds approximately: 



/(^) - 4 



(1) 



where the exponent a takes on a value slightly greater 
than 1, and v4 is a normalising constant. Although this 
is a strong quantitative statement with ubiquitous ap- 
plicability attested over a vast repertoire of human lan- 
guages, some observations are in place. First, Zipf's law 
in its original form as we have written it, can at most 
account for the statistical behaviour of words frequencies 
in a rather limited zone in the middle-low to low range 
of the rank variable. Even in the case of long single texts 
Zipf's law renders an acceptable fit in the small window 
between s ~ 100 and s « 2000, which docs not represent 
a significant fraction of any literary vocabulary. Second, 
the modification introduced by Mandelbrot by using 
arguments on the fractal structure of lexical trees, though 
valuable in terms of possible insights into the statistical 
manifestations derived from the hierarchical structure of 



languages, has not a notorious impact on quantitative 
agreements with empirical data. In fact, the only im- 
provement over the original form of the law is that it fits 
more adequately the region corresponding to the lowest 
ranks, that is s < 100, dominated by function words. The 
generalised form proposed by Mandelbrot can be written 
as follows: 

where C is a second parameter that needs to be adjusted 
to fit the data. 

It has been shown that this form of the law is also 
obeyed by random processes that can be mapped onto 
texts hence ruling out any sufficient character for 

linguistic depth inherent to the Zipf-Mandelbrot law. 
Nevertheless, it has been argued that it is possible to 
discriminate between human writings and stochastic ver- 
sions of texts precisely by looking at statistical properties 
of words that fall beyond the scope where equation (|^) 
holds d. 

In this paper two complementary goals are pursued. 
In the first place we wish to display evidence on the 
statistical significance of the non Zipfian behaviour that 
emerges as a robust feature when large corpora are con- 
sidered. Second, we shall present a complete mathemat- 
ical framework by which we redefine Zipf's law in order 
to describe in a precise manner the empirical frequency - 
rank distributions of words in human writings over the 
whole range of the rank variable. We believe that an ac- 
curate phenomenological understanding can throw light 
in the search of plausible microscopic models as well as 
dictate necessary features that those models should even- 
tually comply with. Moreover, the variety of mechanisms 
that have been proposed to explain Zipf-Mandelbrot law 
[^,||j^J^ do not predict any anomaly associated to high- 
rank words. Whence, the genuine statistical features that 
we will address in our analysis may lead to a more reliable 
hallmark of complexity in human writings. 
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II. STATISTICAL EVIDENCE FOR A 
REFORMULATION OF ZIPF'S LAW 

Generally, Zipf's plots obtained for single texts suffer 
from a lack of sufficient statistics in the region corre- 
sponding to high values of the rank variable, that is, for 
words that appear just a few number of times in the 
whole text. Possibly, that may have been one of the rea- 
sons why so little attention has been devoted so far to 
the distribution of words from s « 2000 onwards. To 
resolve the behaviour of those words we need a signifi- 
cant increase in volume of data, probably exceeding the 
length of any conceivable single text. Still, at the same 
time it is desirable to maintain as high a degree of homo- 
geneity in the texts as possible, in the hope of revealing 
a more complex phenomenology than that simply origi- 
nating from a bulk average of a wide range of disparate 
sources. 

In the Zipf's graphs we show in this paper, the pre- 
sented data points correspond to averages over non over- 
lapping windows on the rank variable, centred at the dis- 
played points. The windows' widths are constant in the 
logarithmic scale and this average is done in order to 
smooth local fluctuations in the data. 
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FIG. 1. Last section of a Zipf plot where we show the ac- 
tual data and the averaged values we use to represent the 
data. The region of highest ranks is dominated by the long 
plateaux associated with very infrequent words. Usually the 
last one or two points ought to be discarded. The data cor- 
respond to a collection of 56 books by Charles Dickens. 

Figure 1 shows the last part of a typical Zipf plot for a 
large text corpus depicting the actual data together with 
the averaged values. The step-like plateaux are a finite 
size effect due to words that appear just a few number 
of times and are indeed a consequence of poor statistics 
for the highest values of the rank. The last one of these 
plateaux, which is also the longest, corresponds to ha- 
paxes, that is words that appear just once in the text. 
Therefore, in any quantitative discussion on the form of 



the frequency-rank distribution some of the last points, 
usually one or two, should be disregarded on the basis of 
the foregoing observations. 

Along these lines, in Figure 2 we show the Zipf's graphs 
obtained for four large corpora, each gathering several 
works from four different authors respectively. In the fig- 
ures, N represents the total number of tokens present in 
the corpus, and V the vocabulary size. It is only when we 
start to analyse large samples like these that robust sta- 
tistical features begin to emerge in the region belonging 
to the higher ranks. The most conspicuous observation 
is that all the curves start to depart from the power-law 
regime at approximately s « 2000 — 3000, and then all 
have a tendency to a faster decay that is slightly different 
for each curve. All the curves roughly agree in the region 
where Zip's law holds and each again has a different be- 
haviour in the lowest ranks. In total three regimes are 
clearly distinguished in the four data sets. 

It is also interesting to note that despite the vocabular- 
ies and styles vary in the four corpora considered, the nat- 
ural divisions in the qualitatively different regimes look 
very similar. More specifically, whereas the intervals on 
the rank axis that cover the first two regimes are roughly 
of the same size for all the four corpora, the difference in 
vocabulary length reflects on the differences in the length 
of the fast-decaying tails for each corpus. This suggests 
that regardless of the different sizes of the texts consid- 
ered, the vocabularies can be divided into two parts of 
distinct nature |]9|,|l0|: one of basic usage whose overall 
linguistic structure leads to the Zipf-Mandelbrot law, and 
a second part containing more specific words with a less 
flexible syntactic function. 
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FIG. 2. Frequency-rank distribution of words for four large 
text samples. In order to reveal individual variations these 
corpora are built with literary works of four different authors 
respectively. The vertical dash line is placed approximately 
where Zip's law ceases to hold. 

The question now may arise as to whether there is a 
kind of asymptotic behaviour as even larger corpora are 
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analysed. It is clear that in order to answer this we are 
compelled to release a certain degree the constraints on 
homogeneity and consider samples from various authors 
and styles. In Figure 3 we show the frequency-rank dis- 
tribution of words in a very large corpus made up of 
2606 books written in English comprising nearly 1.2GB 
of ASCII data. The total number of tokens in this case 
rose to 183403300 with a vocabulary size of 448359 dif- 
ferent words. It is remarkable that the point at which 
the departure from Zipf 's law takes place has just moved 
to r « 6000 despite the increase in sample size. However, 
the striking new feature is that the form of the distribu- 
tion for high ranks reveals as a second power law regime. 
This last result is an independent confirmation of a simi- 
lar phenomenology observed by Ferrer and Sole in a very 
large corpus made up of a collection of samples of mod- 
ern English, both written and spoken, each no longer 
than 45000 words There, they found that the second 
power law regime was characterised by a decay exponent 
close to —2. However, a precise direct measurement of 
this decay exponent poses some difficulty since the last 
portion of the Zipf plot may still be affected by the poor 
statistics of very infrequent words. That may translate 
into a slight underestimation of the absolute value of the 
exponent. Despite this caveat, in Figure 3 we have also 
shown a pure power law with decay exponent a = 2.3, 
solely as a visual reference. 
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FIG. 3. Zipf's plot for a large corpus comprising 2606 
books in English, mostly literary works and some essays. The 
straight lines in the logarithmic graph show pure power laws 
as a visual aid. The last point of the in the Zipf's plot was 
eliminated since it is severely affected by the plateaux associ- 
ated with the least frequent words 

The main purpose of this section was to show that even 
though there is a restricted domain in which the univer- 
sality of Zipf's law seems to be valid, new and solid sta- 
tistical regularities emerge as the sample size is increased. 
These regularities are no less impressive than the origi- 
nal observations made by G. K. Zipf, and, furthermore. 



they might be more deeply related to particular features 
of the complex process of language generation. 

In the next section we will present a mathematical 
model within which all the phenomenology discussed here 
can be quantitatively described. 



III. THE MATHEMATICAL MODEL 

We start from the simple observation that the Zipf- 
Mandelbrot law satisfies the following first order differ- 
ential equation, as can be verified by direct substitution: 



ds 



-Xf" 



(3) 



The solutions to equation (|^) asymptotically take the 
form of pure power laws with decay exponent l/(q — 1). 
It is possible to modify this expression in order to include 
a crossover to another regime, as in the following more 
general equation: 



f = -fir - (A - /i)/^ 

as 



(4) 



where we have added a new parameter and a new ex- 
ponent. In the case 1 < r < g and /i ^ the ef- 
fect of the new additions is to allow the presence of 
two global regimes characterised by the dominance of ei- 
ther exponent depending on the particular value of /. 
The use of this equation in the realm of linguistics was 
originally suggested by C. Tsallis [|ll|, and it had pre- 
viously been used to describe experimental data on the 
re-association in folded proteins |l2) within the frame- 
work of non-extensive Statistical Mechanics jlj]. It is 
worth mentioning here that the Zipf-Mandelbrot law for 
words, equation (||), has been related to Tsallis' gener- 
alised Thermodynamics by means of heuristic arguments 
based on the fractal structure of symbolic sequences with 
long-range correlations jlj] . 

Some qualitative features of the solutions of equation 
(Q) can be grasped by analysing different possibilities for 
the involved parameters. We summarise those in the fol- 
lowing three representative cases and refer the reader to 
reference for a more complete discussion: 

• Let us note first that by taking r = q > 1, or equiv- 
alently fi — and q > 1, we recover equation (^) the 
usual form of Zipf-Mandelbrot law, since by direct 
integration we have 



1 



[1 + (9 - 1^ 



(5) 



where we chose /(O) = 1. 
• Now, letting r = 1 and taking q > 1 one obtains 
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This expression shows a very interesting behaviour 
for /i << A, since for small values of s it reduces 
to equation (||) and then for larger values of s it 
undergoes a crossover to an exponential decay. 

• Finally let us consider the more general situation 
I < r < q. In this case the integration yields 
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- 1 l + q-2r 

X [i/(l;q-2r,g-r,(A^)-l) 

-H{f■,q-2r,q-r,{X/^,)^l)]} 



with the definition 



H{f;a,b,c)^f+%F,{1; 



1 + a 1 



(7) 
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where 2F1 is the hypergeometric function. In this 
case the solution presents a crossover between two 
power law regimes. By direct examination of the 
right hand side of equation (0), it can be seen 
that where {X/^l - l)(-V(9-'-)T >> / the solu- 
tion takes the form f{s) - {{q - l)As)(-i/('?-i» 
and where (A//i — << f it becomes 

f{s) - ((r- l)A.s)(-i/('-i)). 

In all the above discussion the constant of integration 
was chosen in such a way that /(O) = 1, however in or- 
der to introduce a different normalisation it is possible to 
include a normalising constant A as a change of scale in 
the dependent variable, / f /A. 

The analytical expression for the rank distribution in 
the general case , equation (^, is rather cumbersome to 
work with. However, it is possible to derive a much sim- 
pler relation for the probability density function pf{f). 
The value of the rank for a word with a normalised fre- 
quency of occurrence /, can be written in the following 
way : 



Np'Af)df 



(9) 



This can be seen by noting that Np'j{f') gives the num- 
ber of words that appear with normalised frequency /', 
thus the corresponding position in the rank list of a word 
with frequency / equals the total number of words that 
have frequency greater or equal than /. In addition, we 
can also write: 



(10) 



bmce expressions (|) and hold for any value of /, 
the following relation can be established: 



Pfif) oc -| 



(11) 



Moreover, the proportionality with the probability den- 
sity as a function of n = N f is straightforward since 
Pf{f) ~ Npn{n). Finally, all these considerations allow 
us to write the following relation between the probability 
densities and equation ^ : 

pjn) ^ —Pf(f) oc ^ — (12) 

This result is paticularly interesting in view of the math- 
ematical simplicity of equation (12). Thereby, in essence 
we can interpret differential equation (^ as a model for 
the functional form of p„ (n) . 

Now we proceed to test the phenomenological scenario 
we have just deployed against empirical data from actual 
text sources. In Figures 4 and 5 we can see the fit ob- 
tained with equation (^ for two of the large text samples 
already used in Figure 2. Whereas Zipf-Mandelbrot law 
would have only fitted a small percentage of the total vo- 
cabulary present in these corpora, equation (^) captures 
the behaviour of the frequency-rank distribution along 
the whole range of the rank variable. 



f(s) 



Best fit parameters 
(R = 0.999) 
1 0.18375 ±0.00064 
H 0.00016 ±0.00003 
q 1.89 ±0.016 




36 plays by W. Shakespeare (N=885535, V=23150) 
-fit by equation (6) 



FIG. 4. Frequency-rank distribution for a corpus made up 
of 36 plays by William Shakespeare (circles) together with a 
fit (full line) by equation 

In the case of the text corpus of 2606 books the general 
solution (^) must be used in order to fit the frequency- 
rank distribution. Alternatively, by performing a nor- 
malised histogram of the frequencies of appearance of 
each word, we can recast the data in a form suitable to 
be described by equation (^) . Figure 6 shows the proba- 
bility density function p„ (n) for the corpus of 2606 books, 
together with the fit obtained with equation (|l2|) , render- 
ing again a very good agreement with the actual data. 
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Figure 7 depicts the frequency-rank distribution for the 
same corpus compared with a plot of equation (0) using 
the best fit parameters presented in Figure 6 . In this case 
the parameter that controls the second power law regime 
takes the value r = 1.32, indicating an asymptotic decay 
exponent close to —3. In the Figure the range of the rank 
variable was extended in order to make evident the whole 
transient between the two power law regimes. Notwith- 
standing the good fit along the whole set of data points, 
it is clear that a larger text corpus would be necessary 
in order to see a fully developed power law decay in the 
highest ranks that could allow a direct measure of the 
exponent free from finite size effects. 
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FIG. 5. Frequency-rank distribution for a corpus made up 
of 56 books by Charles Dickens (circles) together with a fit 
(full line) by equation (H). 
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FIG. 6. Probability density function pn{n)vs.n for the 
large corpus of literary English and the best fit obtained with 
equation (|l2|). The straight lines show pure power laws that 
correspond to the asymptotic forms of equation (12). 



f(s) 




FIG. 7. Actual data from the corpus of 2606 books in 
English together with a plot of equation (^ with the same 
paremeters shown in Figure 6. A proportinality constant was 
added in the fitting in order to adjust normalisation. 



IV. SUMMARY AND CONCLUSIONS 

The statistical evidence we have presented in this work 
has shown that beyond the universal features described in 
the narrow range of validity of Zipf -Mandelbort law, lies 
a vast region spanning along the rank axes where a non 
trivial macroscopic behaviour emerges when large text 
corpora are considered. We have noticed that the major- 
ity of the words fall in the non Zipfian regime showing 
a systematic and robust statistical behaviour. We have 
also discussed the natural division of words into two dif- 
ferent kind of vocabularies, each prone to distinct lin- 
guistic usage. For large text samples with a high degree 
of homogeneity we found that words for which the value 
of their rank s < 3000 - 4000 obey Zip-Mandelbrot law 
regardless of text length. The rest of the words, whose 
number may differ considerably, all fall in the fast de- 
caying tails that we recognised in Figure 2. A more pro- 
found study of these features could possibly shed light 
on aspects of how language is used and processed by in- 
dividuals. Stepping up two orders of magnitude in text 
size a new and more interesting behaviour was noticed in 
agreement with previous studies. The analysis of a huge 
corpus shows the collective use of language by a society 
of individuals, and more complex features were indeed 
observed as confirmed by the second power law regime 
for words beyond s « 5000 — 10000. Moreover, all the 
variants of the non trivial phenomenology we have just 
discussed could be encompassed within a single mathe- 
matical framework that accurately accounts for all the 
observed features. After the evidence supplied by this 
work it seems quite plausible that there may be a deep 
connection between differential equation (^) and the ac- 
tual processes underlying the generation of syntactic lan- 
guage . 
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